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Abstract 


Real-world data are long-tailed, the lack of tail samples leads to a significant limitation in the gener- 
alization ability of the model. Although numerous approaches of class re-balancing perform well for 
moderate class imbalance problems, additional knowledge needs to be introduced to help the tail class 
recover the underlying true distribution when the observed distribution from a few tail samples does 
not represent its true distribution properly, thus allowing the model to learn valuable information out- 
side the observed domain. In this work, we propose to leverage the geometric information of the feature 
distribution of the well-represented head class to guide the model to learn the underlying distribution 
of the tail class. Specifically, we first systematically define the geometry of the feature distribution and 
the similarity measures between the geometries, and discover four phenomena regarding the relation- 
ship between the geometries of different feature distributions. Then, based on four phenomena, feature 
uncertainty representation is proposed to perturb the tail features by utilizing the geometry of the head 
class feature distribution. It aims to make the perturbed features cover the underlying distribution of 
the tail class as much as possible, thus improving the model’s generalization performance in the test 
domain. Finally, we design a three-stage training scheme enabling feature uncertainty modeling to be 
successfully applied. Experiments on CIFAR-10/100-LT, ImageNet-LT, and iNaturalist2018 show that 
our proposed approach outperforms other similar methods on most metrics. In addition, the experi- 
mental phenomena we discovered are able to provide new perspectives and theoretical foundations for 
subsequent studies. The code will be available at https: //github.com/mayanbiao1234/Geometric-Prior 


Keywords: Long-Tailed Classification, Representational learning, Geometric prior knowledge 


1 Introduction 


Deep learning has made significant progress in 
image classification, image segmentation, and 
other fields benefiting from artificially annotated 
large-scale datasets. However, real-world data 
tends to follow a long-tailed distribution [33], and 
unbalanced classes introduce bias into machine 
learning models. Numerous approaches have been 


proposed to mitigate the model bias, such as class 
re-balancing [6, 12, 50], information augmenta- 
tion [4, 20, 44, 49] and network structure design 
[20, 45, 50]. However, the above approach does not 
work effectively in all cases, and the generalization 
ability of the model will be greatly limited when 
the samples of the tail class do not accurately rep- 
resent its true distribution. We discuss two cases 
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Fig. 1 (a) When the samples uniformly cover the true 
data distribution, the model can learn the correct decision 
boundaries and can correctly classify unfamiliar samples 
to be tested. (b) When the samples cover only a portion 
of the true distribution, unfamiliar samples to be tested 
are highly likely to be misclassified due to the error in the 
decision boundary. (c) The direction in which the arrow 
points is the best direction to expand the sample. 


of the relationship between the observed and true 
distributions of the tail classes [5]. 


e Case 1: The observed samples cover the true 
data distribution uniformly (As shown in Figure 
la). 

e Case 2: The observed samples cover only a small 
region of the true data distribution (As shown 
in Figure 2b). 


In case 1, although the sample size of the tail 
classes is small, these samples represent the true 
data distribution. The main reason for the degra- 
dation of the model performance is that at each 
sampling, samples from the tail class are used with 
a small probability to calculate loss and update 
parameters, resulting in inadequate learning of the 
tail class by the model. Faced with this situa- 
tion, existing data augmentation methods [4, 41], 
undersampling [36], oversampling [37, 44], and re- 
balancing loss [19, 23, 27] can reasonably improve 
performance. Combining decoupled training with 
the above approach can improve the performance 
of the model even further [1, 45]. However, nei- 
ther rebalancing the sample size nor rebalancing 
the loss, they fail to increase the information 
outside the observed training domain. When cer- 
tain classes are severely underrepresented (i.e., 
case (2)), these methods have difficulty find- 
ing the right direction for adjusting the decision 
boundaries, so improvements sometimes worsen 
[5]. Therefore, we pursue to mine knowledge from 
the well-represented head classes to help recover 
the true distribution of the tail classes. 

In case 2, if the underlying true distribution 
of the tail classes cannot be recovered, then the 


model always fails to learn the correct decision 
boundaries. Even if the model achieves high recog- 
nition accuracy in the training set, it still fails 
to have satisfactory generalization performance 
when faced with test samples outside the training 
domain. Therefore, if the direction for recovering 
the true distribution of tail classes, such as the 
direction indicated by the three arrows in Figure 
lc, can be found, the generalization ability of 
the model on the tail classes will be significantly 
improved. It is necessary to explore additional 
knowledge to guide the tail class to recover the 
true distribution. However, the head-to-tail knowl- 
edge transfer methods [1, 14, 26, 34] currently 
proposed for the long-tailed challenge do not yet 
address this issue, so recovering the underlying 
true distribution using few samples is a meaningful 
and extremely difficult challenge. 

It has been shown that the model bias is caused 
by the classifier and the long-tailed data does not 
unbalance the feature representation learning [12, 
50]. Also considering that the dimensionality of 
the feature space is smaller than that of the sample 
space, we focus on recovering the true distribution 
of tail classes in the feature space. In this work, 
our main contributions are summarized as follows. 


e We systematically define the geometry of the 
feature distribution and the similarity mea- 
sure between the geometries (Section 3.1 and 
3.2). Based on this, four surprising experimen- 
tal phenomena are found which can be used to 
guide and recover the true distribution of the 
tail classes (Section 3.3). The most important 
phenomenon is that similar feature distribu- 
tions have similar geometries and the similarity 
between the geometries of the feature distri- 
butions decreases as the interclass similarity 
decreases. We introduce a geometric perspective 
to recover underrepresented class distributions, 
providing a theoretical and experimental basis 
for subsequent studies of class imbalance. 

e Based on four experimental phenomena, we pro- 
pose to model the uncertainty representation 
of the tail features with geometric information 
from the feature distribution of the head class 
(Section 4.1). Specifically, instead of treating 
samples in the feature space as deterministic 
points, we perturb them to make the model 
learn information outside the observed domain 
by taking into account the geometry of the 
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class to which the samples belong. Our proposed 
feature uncertainty modeling can effectively 
alleviate the model bias introduced by under- 
represented classes and can be easily integrated 
into existing networks. 

e We propose a three-stage training scheme 
to apply feature uncertainty representation 
(Section 4.2). The results of the ablation exper- 
iments show that compared to decoupled train- 
ing, the three-stage training scheme improves 
the tail class performance while reducing the 
degradation of the head class performance, 
resulting in more overall performance improve- 
ment of the model. 

e Experiments on large-scale long-tailed datasets 
(Section 5) show that our proposed method 
significantly improves the performance of tail 
classes and exhibits state-of-the-art results com- 
pared to other similar methods. 


2 Related Work 
2.1 Class Rebalancing 


The extreme imbalance in the number of sam- 
ples in the long-tail data prevents the classification 
model from learning the distribution of the tail 
classes adequately, which leads to poor perfor- 
mance of the model on the tail classes. Therefore, 
methods to rebalance the number of samples and 
the losses incurred per class (i.e., resampling and 
reweighting) are proposed. Resampling methods 
are divided into oversampling and undersampling 
[3, 8, 9, 13, 35, 47]. The idea of oversampling 
is to randomly sample the tail classes to equal- 
ize the number of samples and thus optimize 
the classification boundaries. The undersampling 
methods balance the number of samples by ran- 
domly removing samples from the head classes. 
For example, [36] finds that training with a bal- 
anced subset of a long-tailed dataset is instead 
better than using the full dataset. In addition, 
[12, 50] fine-tune the classifier via a resampling 
strategy in the second phase of decoupled train- 
ing. [37] continuously adjusts the distribution of 
resampled samples and the weights of the two-loss 
terms during training to make the model per- 
form better. [44] employs the model classification 
loss from an additional balanced validation set to 
adjust the sampling rate of different classes. 


The purpose of reweighting loss is intuitive, 
and it is proposed to balance the losses incurred 
by all classes, usually by applying a larger penalty 
to the tail classes on the objective function (or loss 
function) [7, 10, 30, 31, 38, 48]. [27] proposes to 
adjust the loss with the label frequencies to allevi- 
ate class bias. [19] not only assigns weights to the 
loss of each class, but also assigns higher weights 
to hard samples. Recent studies have shown that 
the effect of reweighting losses by the inverse of 
the number of samples is modest [23, 25]. Some 
methods that produce more ”smooth” weights for 
reweighting perform better, such as taking the 
square root of the number of samples as the weight 
[24]. [6] attributes the better performance of this 
smoother method to the existence of marginal 
effects. In addition, [1]proposes to learn the clas- 
sifier with class-balanced loss by adjusting the 
weight decay and MaxNorm in the second stage. 


2.2 Stage-wise training 


Decoupling [12] first proposes to decouple the 
learning process on long-tail data into feature 
learning and classifier learning, and it finds that 
re-learning the balanced classifier can significantly 
improve the model performance. Further, BBN 
[50] combines the two-stage learning into a two- 
branch model. The two branches of the model 
share parameters, with one branch learning using 
the original data and the other learning using the 
resampled data. [5] decomposes the features into 
class-generic features and class-specific features, 
and it expands the tail class data by combining 
class-generic features of the head class with class- 
specific features of the tail class. [49] finds that 
augmenting data with Mixup in the first stage 
benefits feature learning and does negligible dam- 
age to classifiers trained using decoupling. [45] 
also observes that long-tailed data does not affect 
feature learning, and it proposes an adaptive cal- 
ibration function for improving the cross-entropy 
loss. [11] considers the effect of noisy samples 
on the tail class and adaptively assigns weights 
to the tail class samples by meta-learning in the 
second stage. Different from the above two-stage 
training process, we propose a three-stage training 
scheme. The first two stages are indistinguishable 
from decoupling training, and in the third stage, 
the classifier parameters are fixed and the feature 
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extractor is fine-tuned to adapt to the improved 
classification boundaries. 

Different from the above two-stage training 
process, we propose a three-stage training strat- 
egy. The first two stages are indistinguishable 
from decoupling training, and in the third stage, 
the classifier parameters are fixed and the feature 
extractor is fine-tuned to adapt to the improved 
classification boundaries. 


2.3 Head-to-tail knowledge transfer 


Head-to-tail knowledge transfer is more rel- 
evant to our work than other methods. [43] 
and [21] were first proposed in the face recognition 
field to transfer variance between classes to aug- 
ment classes with fewer samples. [43] assumes that 
the feature distributions of each class are multi- 
variate Gaussian, and the feature distributions of 
the common and under-represented classes have 
the same variance, the variance of the head class is 
used to estimate the distribution of the tail class. 
[21] assumes that the intra-class angle distribu- 
tion follows a Gaussian distribution, transfers the 
intra-class angle distribution of features to the tail 
class, and constructs a ”feature cloud” for each 
feature to extend the distribution of the tail class. 

Similar to the adversarial attack, [14] proposes 
to transform some of the head samples into tail 
samples through perturbation-based optimization 
to achieve tail class augmentation. [5] decomposes 
the features of each class into class-generic fea- 
tures and class-specific features. During training, 
the class-specific features of the tail class are fused 
with the generic features of the head class to gen- 
erate new features to expand the tail class. This 
idea is similar to the data augmentation in image 
space, such as Cutmix. [34] dynamically estimates 
a set of centers for each class, and then calculates 
the displacement between the head class feature 
and the corresponding nearest intra-class center. 
This displacement is used to combine with the 
tail class centers to generate new features, thereby 
increasing the feature diversity of the tail class. 
[20] proposes to transfer the geometric information 
of the feature distribution boundaries of the head 
class to the tail class by enhancing the weights 
of the tail class classifier. The recently proposed 
CMO [26] considers that the image of the head 
class has a rich background, so the image of the 


tail class can be pasted directly onto the back- 
ground image of the head class to increase the 
richness of the tail class. This method can be eas- 
ily combined with other long-tailed recognition 
methods. 

Previous motivations for head-to-tail knowl- 
edge transfer were limited to qualitative analysis 
or conjecture. Distinguishing from the above stud- 
ies, we pioneered a geometric perspective of head- 
tail knowledge transfer. We systematically define 
the geometry of the distribution and its similarity 
measure and find direct evidence that the geome- 
try of the head class distribution can help the tail 
class. 


3 Motivation 


We first define a measure of the geometry of 
the feature distribution, and then propose a sim- 
ilarity measure between the geometries. Finally, 
across several benchmark data sets, we discov- 
ered four experimental phenomena regarding the 
relationship between geometric information of fea- 
ture distributions. Inspired by the experimental 
phenomenon, we propose to utilize the feature dis- 
tribution of the head class to help the tail class to 
recover the underlying distribution. 


3.1 The Geometry of Data 
Distribution 


In the P-dimensional space, given data X = 
[£1, £2, En] E R?*” that belongs to the same 
class, the sample covariance matrix of X can be 
estimated as 


1< 1 
vx =El— wt )= -XXT eR?P*?, 
x Da n E 


If Nx = Ip and Ip denotes a unit matrix 
of order P, the distribution of X is said 
to be isotropic, while the opposite is said to 
be anisotropic. In practice, the data distribu- 
tion is usually anisotropic. Considering the two- 
dimensional case, we can find two vectors é and 
&, where é points to the direction with the 
largest sample variance, and £ points to the direc- 
tion with the largest variance among the directions 
orthogonal to £. €; and éz can be used to anchor 
the geometry of the two-dimensional distribu- 
tion.In the high-dimensional case, since Ux is a 
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real symmetric matrix, any two of its eigenvec- 
tors are orthogonal to each other, and £é; points to 
the direction with the i-th largest variance. Anal- 
ogously to the two-dimensional case, we can use 
all the eigenvectors of X y to anchor the geometry 
of the distribution. 


Definition 1 (The geometry of data distri- 
bution). Given a P-dimensional sample set X 
and the corresponding covariance matrix Ux. The 
eigendecomposition of Xx yields P eigenvalues 
{\i,A2,-.-, Ap} and the corresponding P-dimensional 
eigenvectors [£1,€2,...,€p] € RPP, All eigenvectors 
of x are considered as bones to anchor the geometry 
of the distribution of X, denoted as 


GDx (£1, €2,-..,€P), 


where AL > A2 > ...Ap > 0, Ello = 1,2 = 
1,2,...,P. 


3.2 Similarity Measure of Geometry 


In the P-dimensional space, given two types of 
data X, = [21,...,%,] E€ R?*" and X> = 
[71,---,2n] E R?*™, their sample covariance 
matrices are estimated as Ux, = 4X Xj] € 
R?*P and Ux, = XX} € R?P*?, respec- 
tively. Performing the eigendecomposition on Ux, 
and Mx,, the geometry of the distributions X, 
and X> are denoted as GDx, (&,,..-,€,) and 
GDx, (Eka Eka) respectively, where €, and 
Ex, (i,j =1,2,...,P) are the eigenvectors of Ux, 
and X y,, respectively. 


Definition 2 (Similarity metric between geom- 
etry). Given the geometry of two distributions 


GDx, (Eka; Eko) and GDx, (€k,,---,€X,), their 
similarity is defined as 


P 
S(GDx,,GDx,) = >> (ke, €k2) =D éka Se 
$=] 


i=1 


The larger S(GDx,,GDx,), the more similar 
the geometry of the distributions X; and X2. The 
upper and lower bounds of S(GDx,,GDx,) are 


0 < S(GDx,,GDx,) < P. 
When any pair of eigenvectors é&%, and €, 


are co-linear, S(GDx,,GDx,) reaches the upper 
bound P. When any pair of eigenvectors Ex, and 


Ek, are orthogonal, S(GDx,,GDx,) takes the 
lower bound value 0. Taking the two-dimensional 
distribution as an example, since 


0 < ori’ dpitdre’ $B2 < Eri’ EB1+ER2" Epo < 2, 


it is clear that the geometry of the two distribu- 
tions in Figure A1 is more similar compared to the 
two distributions shown in Figure A2. The details 
are described in Appendix A. 


3.3 Four Discoveries about the 
Geometry of the Feature 
Distribution 


First define the class similarity measure. Then 
introduce the four phenomena we found and their 
experimental setup. 


Definition 3 Given a sample set De = 
{..., (£i; Yc), ---} of class c, the average prediction 
score Dy DiP(ye | z;,@) of all samples belonging to 
class c is calculated using a deep neural network with 
trained parameters 0, where |De| denotes the sample 
number of class c. Define the class 


1 
h := argmaryze( 7d iP(We | zi, 0))k 


that is most similar to class c, i.e., the class with the 
largest logit other than class c. Further, the similarity 
ranking can be done based on logit. 


We investigated the relationship between class 
similarity and the geometry similarity of class dis- 
tributions on two benchmark datasets: Fashion 
MNIST [39] and CIFAR-10 [15]. ResNet-18 [40] 
was adopted as the backbone network and various 
training schemes were applied to make the perfor- 
mance of ResNet-18 on the two datasets compara- 
ble to the state-of-the-art results (See Appendix B 
for details). First, the similarity between all classes 
on the two datasets is calculated and ranked. 
Then, we extracted 64-dimensional features of all 
samples from both datasets using ResNet18 and 
calculated the geometry of all class feature dis- 
tributions. Based on this, we summarize further 
experiments and findings as follows. 


3.3.1 Phenomenon 1 


As shown in Figure 2, features were extracted 
using trained ResNet-18 on MNIST [16], Fashion 
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Fig. 2 The ratio of the sum of the top five eigenvalues to 
the sum of all eigenvalues after eigendecomposition for the 
feature embeddings of all classes in the three datasets. The 
horizontal coordinates are the indexes of the classes, and 
the specific class names are in Appendix D. 


MNIST and CIFAR-10. We find the sum of the 
eigenvalues corresponding to the first five eigen- 
vectors that are used to represent the geometry of 
the distribution can reach more than 80% of the 
sum of all eigenvalues, which means that most of 
the information of the data distribution can be 
recovered along the first five eigenvectors. 


3.3.2 Phenomenon 2 


Based on the above observations, we set P in 
S(GDx,,GDx,) to 5 and calculate the similarity 
between geometry of all class feature distribu- 
tions in Fashion MNIST and CIFAR-10 and plot 
them in Figure 3a and Figure 3b. We find that 
if two classes have high similarity, then the 
geometry of their feature distributions also 
exhibit high similarity. And as the similarity 
between classes decreases, the similarity between 
the geometry of the class feature distributions 
shows a decreasing trend. Take dog in CIFAR- 
10 as an example, its most and least similar 
classes are cat and automobile, respectively, and 
the geometry of the three classes are represented 
by GTairplane (&1, Saeed ,&64), GToira(m, aaa nea) and 
GThorse(G1,---,¢64). Calculate the matrices M1 
and M2 and plot them in Figure 3c and Figure 
3d, where Mlij = (£i ni) and M2i j = (Enba) 
(i,j = 1,...,64). It can be observed that M1 
is closer to a diagonal matrix compared to M2, 
which corresponds to a more similar geometry of 
dog and cat. 


Ankle voot EE] 113 106 1.03 124 1.41 1.28 083071 -05 


Fashion MNIST 
Tshirt PEJ 1.27 0.69 1.15 0.93 1.28 1.16 1.02 35 


CIFAR-10 
1.70 1.20 03 184) osi 1.64 143 
0,81 0.54 0.65 0.84 1.00 0.49 1,14 


56 1,52 157 1.15 0.83 0.86 0.64 
BEEJ 1.20 152 152 1.48 0.71 036 PET) 1.47 1.39 145 1.18 1.00 056 
1.27 120 132 124 058 0.48 25 96 1.20 0.86 0.90 0.65 
154 1.46 114 MSH 1.13 1.14 0.46 


IT 1.47 1,46 0.74 1.43 1.16 0.49 1,03 


1.96 0.99 152 1.03 1.22 1.02 0.72 
Sandal 1.13 1,67 0.58 0,69 0.64 1.16 1.48 0.72 
136 157 141069041 -15 
0.86 1.24 0.48 0.41 0.36 


.39 1.14 0.74 1.64 0.90 1.22 0.54 0.83 
54 0.96 1.45 0.81 0.85 1.22 1.02 0.84 


Bag - 1.67 1.32 1.36 0.99 1,52 0.69 1.06 1,56 1.24 ship- 1.70 1.32 1.18 0.86 1.18 0.90 1.13 1.02 1.03 


ig 2-05 252 a 0.85 0.90 116 056 0.83 0.46 


12 3 4 5.6 7 8 9 


dog-automobile 


60565248444036322824201612 8 4 0 
60565248 444036322824201612 8 4 0 


Fig. 3 (a) The horizontal coordinates are the indexes 
of the classes, and 1 to 9 indicate the classes that are 
most similar to the class represented by the vertical coor- 
dinates to the least similar, respectively. Each element 
represents the similarity of the geometry between classes. 
See Appendix D for detailed class names. (b) Same as (a). 
(c) The inner product between all eigenvectors of dog and 
all eigenvectors of cat in CIFAR-10. The sum of the first 
five diagonal elements of M1 is equal to the value of the 
element in the first column of the first row in (b). (d) The 
inner product between all eigenvectors of dog and automo- 
bile in CIFAR-10. 


To prove that the above phenomenon does not 
occur by chance, we give the probability that the 
experimental results in Figure 3c occur randomly. 
Given two random vectors in a P-dimensional 
space, let their inner product be 6 € [0,1]. The 
probability density function of 6 is represented as 


(P=3) 
2 


(1— 6") : 


The detailed derivation and proof process of 
the above equation is shown in the Appendix C. 
Setting P in fp(ô) to 64, when ô is taken as the 
first five diagonal elements of M1 respectively, 
the calculation result of fp(-) is shown in Figure 
4a. Considering only the first five diagonal ele- 
ments of M1, the probability of the situation 
shown in Figure 3c occurring is almost 0. Not only 
that, we observed numerous such phenomena (see 
Appendix D), thus implying that the phenomena 
we found could hardly have occurred by chance. 


3.3.3 Phenomenon 3 


The phenomenon that features distributions of 
a similar class has similar geometry only occurs 


autom 


Springer Nature 2021 TEX template 


dog-dog 


fol0.6545) = 6.52x 10-12 


Jol0.5915) = 3.37X10-9 


fo(0.7733 
A fp(0.8423 


“A fo(0.5612) = 4.27x10-% 


60565248444036322824201612 8 4 0 


=100 -075 -oso -025 obo 035 050 075 100 
@) (b) 


airy 


09 automobile -0.4 


dog -0.36 0.29 
frog -0.32 0.33 


37 0.48 0.49 0158 


e e 


ap 2 
$ oe 2 
FEL ESE § 


$ 


ee 


2 
g 
Cg 
E F 


© es @) 


Fig. 4 (a) The function curve of Equation 1. It can be 
observed that as the dimensionality increases, any two ran- 
dom vectors tend to be orthogonal to each other. (b) When 
two different models are used to extract features of dog 
separately, the geometry of the two feature distributions is 
not similar. (c) Cosine similarity between feature centers 
of classes on CIFAR-10. (d) Cosine similarity between fea- 
ture centers of classes on CIFAR-10-LT. 


when all features are extracted using the same 
model. Figure 4b shows that there is a low sim- 
ilarity between the geometry of dog computed 
by two different ResNet18 trained with random 
initialization. More examples in Appendix E. 


3.3.4 Phenomenon 4 


We conducted further experiments on CIFAR-10 
as well as its long-tailed version CIFAR-10-LT. 
In CIFAR-10-LT, airplane, automobile, bird, and 
cat are considered head classes and the remain- 
ing classes are tail classes. As shown in Figure 4d, 
we confirmed that the most similar class to the 
tail class usually belongs to the head class [5] and 
found that if a tail class and a head class show 
high similarity in CIFAR-10-LT, they also show 
high similarity on CIFAR-10. 


3.3.5 Summary and inspiration 


Combining the above four phenomena, we propose 
the following idea: the most similar head class is 
selected for each tail class in the training process, 
and the geometry of the head class feature distri- 
bution is taken as a priori knowledge to guide and 
recover the underlying true distribution of the tail 
class. 


4 Methodology 


We first introduce how to leverage the geometric 
information of the head class feature distribution 
to model the uncertainty representation of the tail 
class features, allowing the model to learn the 
underlying true distribution of the tail class. Then 
a three-stage training scheme is proposed to apply 
the feature uncertainty representation. 


4.1 Feature Uncertainty 
Representation 
Given a tail class t, the head class that is most 


similar to class t is assumed to be h. The P- 
dimensional feature embedding belonging to tail 


T 

class t is z% = ace || € RP: and the 

feature embedding belonging to head class h is 
T 

e= ae paon Ee RPXNn, where N; and Np 


denote the sample numbers of class t and class A, 
respectively. The i-th feature embedding of z; is 
denoted by zł. For the model to learn the under- 
lying distribution of the tail class t, we want to 
utilize the existing feature embeddings to generate 
feature embeddings that can cover the underly- 
ing distribution of the tail class t. We therefore 
propose to model the uncertainty representation 
of z! with the geometry of the feature distribu- 
tion of class h, i.e., zf is no longer considered a 
deterministic point. 

The sample covariance matrix of class h is 
estimated as n = hR € RPP. The 
eigenvalues of the matrix Xp are denoted as 
[Ak AF] € RP, where A} > --- > AP. The 
eigenvector [&j,...,€£] € RP*P, which corre- 
sponds one-to-one with the eigenvalues, anchors 
the geometry of the class h feature distribution, 
where Sale = 1, i = 1,..., P. Since the distri- 
butions of similar class have similar geometry, we 
propose to represent the uncertainty of zf by cen- 
tering a single feature embedding z} of the tail 
class t and performing a random translation to 2} 
along a random linear combination of €/,...,€. 
Considering that the “scope” of the distribution 
is larger in the direction with larger eigenvalues 
[51], an additional weight Af is assigned to €f 
(i =1,...,P) when the eigenvectors are randomly 
combined, which means that zł is translated far- 
ther with higher probability in the direction with 
larger eigenvalues. In summary, the final form of 
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the proposed method can be represented as 


Uncertainty representation of zt 


— 
FUR(zj) = 2+ eG eR? 
j=l 
ej ~ N(0,1),j =1,...,P. (1) 
€,,...,€p all follow the standard Gaussian 


distribution and are independent of each other, 
and sampling them randomly multiple times can 
produce new feature embeddings with different 
translation directions and distances. In particu- 
lar, when €; = 1,€2,...,ep = 0, zf is translated 
along ¿} by a distance A}. And so on, the max- 
imum translation distances of zf in the direction 
represented by each feature vector individually are 
Ah, ..-, AP, respectively. 

Our proposed method can be integrated as a 
flexible module after the feature sub-network. It 
generates augmented samples of tail classes in the 
feature space to cover the underlying distribution, 
giving the model better generalization ability on 
long-tailed data. Note that this module is only 
applied during model training and can be 
discarded during testing without affecting 
the inference speed. 


4.2 Training Scheme 


We propose a three-stage training scheme to apply 
feature uncertainty representation so that the 
model learns information outside the observed 
domain. Decoupled training is adopted for the first 
two phases. In Phase 1, the long-tailed dataset is 
used to learn the feature sub-network and classi- 
fier. In Phase 2, the uncertainty representation of 
the tail feature is applied to generate new sam- 
ples for reshaping the decision boundaries. Unlike 
decoupled training, we additionally add Phase 3 
to fine-tune the feature sub-network to adapt it to 
the new decision boundaries. 


e Phase-1: Initialization training. Represent 
an end-to-end deep neural network as a combi- 
nation of a feature sub-network and a classifier: 
M = {f(x,61), 9(z, @2)}, where 0; and 02 are 
the parameters of the network. We utilize all 
images from the dataset to learn the feature 
sub-network f(x,@1) as well as the classifier 
g(z, 02). After training is completed, the head 


classes that are most similar to each tail class 
are selected based on the average prediction 
score of the model (see Definition 3), and the 
geometry of the feature distribution of these 
head classes is represented by the eigenvec- 
tors of the covariance matrix, which will be 
applied to guide the recovery of the tail class 
distribution. 

e Phase-2: Reshaping decision boundaries. 
Freeze the parameters of f(x,01) and employ 
feature uncertainty representation in feature 
space for the tail class to fine-tune the classi- 
fier to improve the performance of the tail class. 
Specifically, in each iteration, we randomly sam- 
ple Nr images from the tail class, and then 
generate N4 augmented samples for each true 
sample by feature uncertainty representation. 
Meanwhile, to balance the sample distribution, 
we directly sample Nr(1 + Na) samples ran- 
domly from the head class. The tail class sam- 
ples and the head class samples together form a 
mini-batch containing 2N;(1+N,) samples for 
fine-tuning the classifier. The N4 and Nr set- 
tings are related to the batch size, and they are 
described in detail in Section 5.2. 

e Phase-3: Fine-tuned feature sub-network. 
Fine-tuning the decision boundary can improve 
the performance of the tail class while com- 
promising the performance of the head class 
[43]. This is because the feature sub-network is 
not well adapted to the new decision boundary. 
Therefore, we propose to freeze the parameters 
of g(z, 02) at Phase-3 and fine-tune f(x, 61) with 
the original long-tailed data. 


The above three-phase training process is summa- 
rized in Algorithm 1. 


5 Experiments 


5.1 Datasets and Evaluation Metrics 


We evaluate the effectiveness and generalizabil- 
ity of our approach at CIFAR-10/100-LT [2, 15], 
ImageNet-LT [22], and iNaturalist2018 [32]. For 
a fair comparison, the training and test images 
of all datasets are officially split [42, 46], and the 
Top-1 accuracy on the test set is utilized as a 
performance metric. 


e Both CIFAR-10 and CIFAR-100 [15] contain 
60,000 images, of which 50,000 are used for 
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Algorithm 1 Feature Uncertainty Representation 


Require: A long-tailed dataset D containing S samples. A CNN network M = {f(x, 01), 9(z, 02)}, where 
0; and 62 denote the parameters of the feature sub-network and classifier, respectively, and x and z denote 
the input and feature embedding of the model, respectively. 


: for epoch = 1 to m1 do 


: end for 


e VNp 


Training model M on dataset D without using any class rebalancing method. 


: Using M, the head classes that are most similar to each tail class are calculated, and then the sample 


covariance matrix of these head classes is calculated. 


5: for epoch = m1 to m2 do 

6: Freeze the parameters 0, of the feature sub-network. 

7: for iteration = 0 to ee d 

8: A mini-batch {(x;, yi) otier size is sampled from D, where the sample numbers from the tail 
class are Nr and the sample numbers from the head class are Nr(1+ Na). 

9: Compute the feature embedding: 

0: Zi = f(xi, 01), i = 1,..., (2Nr + NrNa). 

1: Uncertainty representation of all features from tail classes: FU R(2Ż$) = z} +X; ej, ERP, 
t € tail class, i € int[0, N;]. h denotes the head class most similar to t. 

2: ej ~ N(0,1),j =1,...,P. Na augmented features are generated for the true features of each 


3 

A: 

5: A = A a aVo, L (g (zi, 02) Yi). 

6: end for 

7: end for 

8: for epoch = m2 to m3 do 

9 Freeze the parameter 02 of g(z, 02). 


tail class by randomly sampling N4 times of e;(1,...,P). 
A mini-batch with a balanced distribution containing 2Nr(1 + N4) samples is obtained. 
Compute the cross-entropy loss L(g(z;, 2), yi) and update the parameters of the classifier: 


20: Fine-tuning the parameters of the feature sub-network using the long-tailed dataset D. 


21: end for 


training and 10,000 for validation, and they 
contain 10 and 100 classes, respectively. For a 
fair comparison, we use the long-tailed version 
of the CIFAR dataset. The imbalance factor 
(IF) is defined as the value of the number of the 
most frequent class training samples divided by 
the number of the least frequent class training 
samples. The imbalance factors we employ in 
our experiments are 10, 50, 100, and 200. 

e ImageNet-LT is an artificially produced 
unbalanced dataset utilizing its balanced ver- 
sion (ImageNet-LT-2012 [28]). It with an imbal- 
ance factor of 256, contains 1000 classes totaling 
115.8k images, with a maximum of 1280 images 
and a minimum of 5 images per class. 

e The iNaturalist2018 dataset is a large-scale 
real-world dataset that exhibits a long tail. It 
contains 437,513 training samples from 8, 142 
classes with an imbalance factor of 500 and 
three validation samples per class. 


5.2 Implementation Details 


Following the accepted settings [6, 45, 49], the 
batch sizes on ImageNet-LT and iNaturalist2018 
were taken to be 256 and 512, respectively. For 
a fair comparison, we are consistent with OFA 
[5] and take Ny to be 3, so Nr is 32 and 64 on 
ImageNet-LT and iNaturalist2018, respectively. 
More details of the experimental setup are listed 
in Table 2. We trained models on CIFAR-10-LT 
and CIFAR-100-LT using a single NVIDIA 2080Ti 
GPU and on ImageNet-LT and iNaturalist2018 
using 4 NVIDIA 2080Ti GPUs. 


5.3 Comparative Methods 


We train the proposed Feature Uncertainty Rep- 
resentation (FUR) employing decoupled train- 
ing and three-stage training schemes, respec- 
tively. The FUR is then compared with classical 
and state-of-the-art long-tailed knowledge transfer 
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Table 1 Comparison on CIFAR-10-LT and CIFAR-100-LT. The accuracy (%) of Top-1 is reported. The best and second- 
best results are shown in underlined bold and bold, respectively. FUR-Decoupled indicates a FUR with decoupled 
training, and FUR Default indicates a FUR with three-stage training scheme. 


Dataset Pub. CIFAR-10-LT CIFAR-100-LT 
Backbone Net - ResNet-32 
imbalance factor - 200 100 50 10 200 100 50 10 
Cross Entropy - 65.6 70.3 74.8 86.3 34.8 38.2 43.8 55.7 
BBN [50] CVPR 2020 - 79.8 82.1 88.3 - 42.5 47.0 59.1 
UniMix [41] NeurIPS 2021 78.5 82.8 84.3 89.7 42.1 45.5 51.1 61.3 
MetaSAug [18] CVPR 2021 76.8 80.5 84.0 89.4 39.9 46.8 51.9 61.7 
MiSLAS [49] CVPR 2021 - 82.1 85.7 90.0 - 47.0 52.3 63.2 
CDB-W-CE [29] IJCV 2022 - - - - - 42.6 - 58.7 
GCL [17] CVPR 2022 79.0 82.7 85.5 - 44.9 48.7 53.6 - 
OFA [5] ECCV 2020 75.5 82.0 84.4 91.2 | 41.4 48.5 52.1 65.3 
M2m [14] CVPR 2020 - 78.3 - 87.9 - 42.9 - 58.2 
RSG [34] CVPR 2021 - 79.6 82.8 - - 44.6 48.5 - 
CMO [26] CVPR 2022 - : - - - 50.0 53.0 60.2 
FUR-Decoupled - 79.6 83.4 86.1 90.7 45.8 50.7 53.9 61.4 
FUR - 79.8 83.7 86.2 90.9 | 46.2 50.9 54.1 61.8 
Table 2 Details of the experimental setup. The 100+50+ [45], BBN [50 with two branches, and the non- 


50 in Epoch indicates the first phase, the second phase, 


and the third phase are trained for 100, 50, and 50 epochs, transfer augmentation methods UniMix [41], 


respectively. MetaSAug [18], CDB [29] and GCL [17]. 
Dataset CIFAR-10/100-LT|ImageNet-LT|iNaturalist2018 5.4 Results on CIFAR-10-LT and 
me iad CIFAR-100-LT 
a| [Phaset|0.05 oa 0:1 The results on CIFAR-10-LT and CIFAR-100- 
9 LR|Phase2 0.001 0.001 0.001 LT are summarized in Table 1, where our 
H proposed method achieves optimal performance 
S| [Phase3|0.001 0.001 0.001 on six long-tailed CIFAR datasets and second- 
B ; ; , best results on the remaining two datasets. Our 
OJER decay. |Conine Linear Linear proposed FUR outperforms GCL by 1% and 
Batch size|128 256 512 2.2% on CIFAR-10-LT and CIFAR-100-LT with 
IF=100, respectively. On the CIFAR-100-LT with 
Warm-up |V X X IF=10, FUR-Decoupled outperforms the com- 
Backone ResNet-32 ResNeXt-50 |ResNet-50 bined CMO by 1.6%. FUR with a three-stage 
Epoch 100+50+50 100+50+50 |100+50+50 training scheme outperforms FUR-Decoupled on 


all datasets, which we will analyze in detail in the 
next part of the experiment. 

Compared to CMO, which randomly pastes 
the image foreground of the tail class onto the 
background of the head class image, our pro- 
posed FUR relies on the observed prior knowledge 


methods, non-transfer data augmentation meth- 
ods, and other state-of-the-art long-tailed recogni- 
tion methods. The specific methods are classified 
as follows. 


e Classical and latest long-tailed knowledge 
transfer methods, include OFA [5], M2m [14], 
RSG [34], GistNet [20], and CMO [26]. 

e Other state-of-the-art methods. They 
include the two-stage MiSLAS [49], DisAlign 


to recover the underlying distribution of the tail 
class. GCL constructs the same “feature cloud” 
for each feature of the tail class to adjust the 
model logit, without taking into account the dif- 
ferences in domain characteristics between classes. 
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Table 3 Top-1 accuracy (%) of ResNext-50 on ImageNet-LT and Top-1 accuracy (%) of ResNet-50 on iNaturalist2018 for 
classification. The best and the second-best results are shown in underline bold and bold, respectively. FUR-Decoupled 
indicates a FUR with decoupled training, and FUR Default indicates a FUR with three-stage training scheme. 


ImageNet-LT iNaturalist 2018 
Methods Pub. 

ResNext-50 ResNet-50 

Head Middle Tail Overall | Head Middle Tail Overall 

BBN [50] CVPR. 2020 43.3 45.9 43.7 44.7 49.4 70.8 65.3 66.3 
DisAlign [45] CVPR. 2021 59.9 49.9 31.8 52.9 68.0 71.3 69.4 70.2 
UniMix [41] NeurIPS 2021 - - - 48.4 - - - 69.2 
MetaSAug [18] CVPR 2021 - - - 47.3 - - - 68.7 
MiSLAS [49] CVPR 2021 | 65.3 50.6 33.0 53.4 73.2 72.4 70.4 71.6 
CDB-W-CE [29] IJCV 2022 - - - 38.5 - - - - 
GCL [17] CVPR, 2022 - - - 54.9 - - - 72.0 
OFA [5] ECCV 2020 47.3 31.6 14.7 35.2 - - - 65.9 
RSG [34] CVPR. 2021 63.2 48.2 32.3 51.8 - - - 70.2 
GistNet [20] ICCV 2021 52.8 39.8 21.7 42.2 - - - 70.8 
BS + CMO [26] CVPR. 2022 62.0 49.1 36.7 52.3 68.8 70.0 72.3 70.9 
FUR-Decoupled - 65.1 51.6 38.3 55.2 73.4 72.5 73.7 72.4 
FUR - 65.4 52.2 37.8 55.5 73.6 72.9 73.1 72.6 


As aresult, FUR outperforms similar methods on 
multiple datasets. 


5.5 Results on ImageNet-LT and 
iNaturalist2018 


We report in Table 3 not only the overall perfor- 
mance of FUR and FUR-Decoupled on ImageNet- 
LT and iNaturalist2018 but also additionally add 
the performance on three subsets of these two 
datasets, Head (more than 100 images), Mid- 
dle (20-100 images), and Tail (less than 20 
images). Compared to other methods, FUR shows 
the state-of-the-art overall performance on both 
ImageNet-LT and iNaturalist2018. 

We argue that although the bias of the clas- 
sifier is mitigated after decoupled training, it 
is ignored whether the feature sub-network can 
adapt to the new decision boundaries, which leads 
to a trade-off in the performance of the head 
classes. Therefore we add a third stage to fine- 
tune the feature extractor to adapt it to the 
latest decision boundaries. FUR-Decoupled out- 
performs the transfer-based CMO by 3.4% and 
4.8%, respectively, on the Head subset of the two 
large-scale long-tailed datasets, benefiting from 
the fact that FUR relies on prior knowledge rather 
than randomly recovering the tail class distribu- 
tion. Although there is a slight degradation in tail 


Fig. 5 Visualization of tail class feature embedding from 
CIFAR-10-LT with an imbalance factor of 200. 


class performance after the third stage, the over- 
all performance of the model and the performance 
on the head subset are better than the decoupled 
trained model. Thus both the feature sub-network 
and the classifier need to be fine-tuned to rebal- 
ance the preferences of the model. In addition, the 
extraordinary performance of FUR-Decoupled on 
tail classes suggests that our method can recover 
the underlying distribution of tail classes more 
efficiently. 


5.6 Visualization Analysis 


To clearly demonstrate that FUR can excel in 
the recovery of the underlying distribution of tail 
classes, we visualized the tail features of CIFAR- 
10-LT via t-SNE. As shown in Figure 5, the train- 
ing distribution after augmentation with FUR can 
cover the test distribution well. The above results 
further show that our proposed method efficiently 


202302.00080v2 


chinaXiv 


Springer Nature 2021 ATEX template 


recovers the distribution of tail classes. This result 
further indicates that our proposed method accu- 
rately recovers the underlying distribution of the 
tail classes, allowing the model to perform better 
on the test set outside the training domain. 


6 Conclusion 


In this work, We discovered four fundamental phe- 
nomena regarding the relationship between the 
geometry of feature distributions, which provide 
the theoretical and experimental basis for subse- 
quent studies of class imbalance. Inspired by the 
four phenomena, we propose feature uncertainty 
representation (FUR) with geometric information 
for recovering the true distribution of tail classes. 
After three stages of training, the experimental 
results show that our proposed method greatly 
improves the performance of the tail class com- 
pared to other methods and ensures the superior 
performance of the head class at the same time. 


7 Data Availability 
Statements 


All datasets used in this study are open-access and 
have been cited in the paper. 


Appendix A 


To facilitate the analysis and understanding of the 
geometry of the feature distributions and the simi- 
larity between the geometry, four two-dimensional 
distributions were generated and plotted in Figure 
Al and Figure A2. The geometry of the feature 
distribution is first introduced. As shown in Figure 
A1, the direction ¿rı with the largest variance and 
the direction Er with the largest variance in the 
direction orthogonal to pı are selected. It can be 
seen that the geometry and location of the distri- 
bution can be anchored by Ep1, €2 and the center 
of the distribution. It is important to note that in 
this work, we only focus on the shape of the distri- 
bution and ignore the location of the distribution. 
Moreover, if the projection is done along these two 
directions, the information of the distribution is 
preserved to the maximum extent. 

Observing Figure Al, we can notice that the 
geometry of the red and blue distributions are 


-3 -2 -1 0 1 2 


Fig. A2 Two distributions with low geometry similarity. 


more similar because their covariances are simi- 
lar, i.e., the pattern of variation of the vertical 
axis with the horizontal axis is similar. The direc- 
tion of the maximum variance mainly determines 
the shape of the distribution, and the direction of 
the second largest variance also plays a role in the 
shape of the distribution. Obviously, the geome- 
try of the two distributions in Figure Al is more 
similar than in Figure A2. 


Appendix B 


The CIFAR-10 dataset consists of 60,000 images 
with size 32 x 32 from 10 classes, each class con- 
tains 6,000 images, of which 5,000 images are used 
for training and 1,000 images are used for testing. 
Fashion MNIST has ten classes, each containing 
6000 training images and 1000 test images, each 
with a size of 28 x 28. 

To improve the generalization ability of the 
model and prevent the model from overfitting on 
the training set, we perform three data augmen- 
tation operations on the training set: random flip, 
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random crop, and Cutout. Cutout keeps the model 
from being overly dependent on certain areas of 
the image by randomly masking out parts of the 
image. Considering the size of the image is small, 
the size 7x7 convolutional kernel of ResNet-18 and 
the pooling operation tend to lose spatial informa- 
tion, so we remove the maximum pooling layer and 
adopt the size 3 x 3 convolutional kernel instead 
of the size 7 x 7 convolutional kernel. 
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Fig. B3 Class accuracy of ResNet-18 on Fashion MNIST 
and CIFAR-10. The class indexes in Figure 2 correspond 
to the ten numbers of MNIST. For Fashion MNIST and 
CIFAR-10, the class indexes (i.e., 1 to 10) correspond to the 
classes from left to right in the above figure, respectively. 


We adopt SGD to optimize the model, set the 
batch size to 128, and the initial learning rate to 
0.1. If the loss does not decrease after 10 con- 
secutive epochs, the learning rate becomes 0.5 
times of the original, and we train a total of 250 
epochs. ResNet-18 achieved an accuracy of 93.46% 
on CIFAR-10 and 94.82% on Fashion MNIST. The 
accuracy rates for each class are plotted in Figure 
B3. 


Appendix C 


In the following, we derive the probability density 
function of the inner product of two random vec- 
tors. Without loss of generality, we set x to be a 
P-dimensional random unit vector and fix y to be 
a unit vector, i.e. 


xv = (#1,22,...,vp),y = (1,0,..., 0). 


The above equation satisfies 2} ++23+---+a2% = 1. 
Using spherical transformations 


Tı = r COS Y1, 


= r sin 91 COS Yo, 


8 
X 
l 


£3 = r sin Y1 SİN Y2 COS Y3, 


Zn—1 = r sin Y1 sin p2 . . . SİN Yn—2 COS Pn—1, 


Ln = r sin 91 sin Y2 . . . SİN Yn—2 SİN Yn—1, 


where 
0<r< +o, 
0< 1 <7, 
0 Pn-2 S T, 
0 < Yn-1 < 20 


The Jacobi determinant of the above transforma- 
tion is 

a(z1, T2, eee Br) 

O(r, p1,- --, Pn-1) 

=r"! sin”? p1 sin”? p2... Sin Yn—2. 


J= 


Since x is a unit vector, r = 1. Notice that 
(x,y) = zı = cosy, so cos(x,y) = cosy. 
According to the geometric probability, 


Pa(pı < 0) = 
0. ne . 2 
Jo sin”? yidyi::: dc. sin Yn—-2dyn—2 J is dpn-1 
os : 2 
da sin”? ydy1--- J sin? Yn-2dYn-2 Jó T din—1 


O ne 
— Sna Jo sin” ? ody, 
= 7, ; 


Where S,, denotes the surface area of the n- 
dimensional unit sphere. When k is a positive 
integer, 


T z 
f sin*—1 pdy = 2 | sin’! ydo, 
0 0 


and because 


zg (Q2m—1)" ot n = om 
f sin" pdp= 4 m ' 
0 Bm4 e = 2m +1 
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the expression of Sn is obtained. For convenience, 
we can use the I function to unify the two, then 


2m? 
Sn Sa: 
I>) 
We can obtain 
P(3) 9 
Pa(pı < 0) = rA] sin”? gid. 
r( 51 Vr) 0 


Further, the probability density function of 0 is 
calculated as 


<0 

do (yı < 8) 
r) 

Da ya 

We plot the curve of the function f,,(@) in 

Figure C4. It can be seen that the angle between 


the two random vectors tends to 7/2 as the 
dimensionality increases. 


f.(0) = “<P, 
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Fig. C4 The probability density function of the angle 
between two high-dimensional random vectors. 

Let ô = cos@. Then the probability density 
function of 6 is 


fr{5) = inte 


In the main text, the dimensionality of the feature 
vector is denoted by P. We plot the curve of the 
function f,(d) in Figure C5. It can be seen that 
the inner product between two high-dimensional 
random vectors tends to 0 as the dimensional- 
ity increases, which means that the two random 
vectors tend to be orthogonal. The above results 
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Fig. C5 The probability density function of the inner 
product between two high-dimensional random vectors. 


prove that our findings did not happen by chance 
and that the experimental phenomena we summa- 
rized are reliable. 


Appendix D 


Figure 3 shows the similarity of each class to the 
other classes on Fashion MNIST and CIFAR-10, 
and is sorted in descending order of similarity. In 
Table C1, we list in detail the name of each class 
in Figure 3a and Figure 3b. 

In addition to the geometry similarity between 
the feature distributions of dog and cat shown 
in Figure 3c, we also plot the geometry similar- 
ity between other classes with high similarity in 
Figure D6. This evidence strongly suggests that 
our findings are not accidental. 


Appendix E 


In this section, we provide additional experimental 
results for phenomenon 3. Features of all classes 
in CIFAR-10 were extracted using two ResNet- 
18 trained with different initialization parameters, 
and then the geometry similarity between the 
feature distributions of the same class extracted 
by different models was calculated. All additional 
experimental results are plotted in Figure D6, 
where it can be observed that the same class of 
features extracted by the different models does not 
match phenomenon 2 at all. 
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Table C1 Details of all classes in Figure 3a and Figure 3b. 


Shirt Pullover | Dress | Bag Trouser | Sneaker | Ankle Sandal | Coat 
T-shirt | 3.63 2.56 1.27 0.69 1.15 0.93 1.28 1.16 1.02 
Dress Coat Bag Pullover | Shirt T-shirt | Ankle Sneaker | Sandal 
Trouser | 2.02 1.52 1.56 1.52 1.57 1.15 0.83 0.86 0.64 
Shirt Coat T-shirt | Dress Trouser | Bag Sandal | Ankle Sneaker 
Pullover | 3.42 2.62 2.56 1.20 1.52 1.52 1.48 0.71 0.36 
Coat Trouser | Shirt T-shirt | Pullover | Bag Ankle Sandal | Sneaker 
Dress 2.58 2.02 1.97 1.27 1.20 1.32 1.24 0.58 0.48 
Pullover | Dress Shirt Bag Trouser | Ankle Sneaker | T-shirt | Sandal 
Coat 2.62 2.58 1.96 0.99 1.52 1.03 1.22 1.02 0.72 
Sneaker | Ankle Bag Dress Shirt Trouser | T-shirt | Pullover | Coat 
Sandal | 3.29 1.13 1.67 0.58 0.69 0.64 1.16 1.48 0.72 
T-shirt | Pullover | Coat Dress Bag Trouser | Ankle Sandal | Sneaker 
Shirt 3.63 3.42 1.96 1.97 1.36 1.57 1.41 0.69 0.41 
Sandal | Ankle Coat T-shirt | Trouser | Bag Dress Shirt Pullover 
Sneaker | 3.29 2.65 1.22 1.07 0.86 1.24 0.48 0.41 0.36 
Sandal | Dress Shirt Coat Pullover | T-shirt | Ankle Trouser | Sneaker 
Bag 1.67 1.32 1.36 0.99 1.52 0.69 1.06 1.56 1.24 
Sneaker | Sandal | Bag Coat Dress Shirt T-shirt | Trouser | Pullover 
Ankle 2.65 1.13 1.06 1.03 1.24 1.41 1.28 0.83 0.71 
bird trunk ship cat |horse [deer automobile | frog dog 
airplane [3.48 3.05 1.70 1.20 |0.81 1.94 0.81 1.64 1.43 
trunk [ship airplane|forg |cat horse |bird dog deer 
automobile|2.52 1.32 0.81 0.54 10.65 0.84 1.00 0.49 1.14 
airplane |deer cat dog |frog horse |ship automobile] trunk 
bird 3.43 2.83 2.28 1.47 |1.39 1.45 1.18 1.00 0.56 
dog bird deer frog jhorse  |airplane|ship trunk automobile 
cat 3.04 2.28 2.17 1.93 |0.96 1.20 0.86 0.90 0.65 
bird cat horse dog |frog airplane|ship automobile] truck 
deer 2.83 2.17 1.54 1.46 [1.14 1.94 1.13 1.14 0.46 
cat horse bird deer | frog airplane | truck ship automobile 
dog 3.04 2.44 1.47 1.46 |0.74 1.43 1.16 1.03 0.49 
cat bird deer dog |jairplane| ship horse automobile] truck 
frog 1.93 1.39 1.14 0.74 |1.64 0.90 1.22 0.54 0.83 
dog deer cat bird Jairplane|truck | frog ship automobile 
horse 2.44 1.54 0.96 1.45 |0.81 0.85 1.22 1.02 0.84 
airplane|automobile|truck |cat |bird frog deer horse dog 
ship 1.70 1.32 1.18 0.86 /1.18 0.90 1.13 1.02 1.03 
airplane|automobile|ship horse |cat dog bird frog deer 
truck 3.05 2.52 1.18 0.85 10.90 1.16 0.56 0.83 0.46 
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